For the purposes of the current excercises, select a stream from a rainy area with relatively small discharge so that the effect of short but strong storms is visible. Good choices are small rivers from the north-eastern US, e.g. site 01589440. Retrieve at least 10 years of data.
Large amounts of historical surface water data are available from the United States Geological Survey (USGS) at https://waterdata.usgs.gov/nwis The goal of the project is to retrieve samples from the web interface manually, and then later automate the process by calling the web service as described at https://help.waterdata.usgs.gov/faq/automated-retrievals.
You can also use the dataset from the /home/course/Datasets/02-data/ folder as well.
Find a database of historic water level measurement of the Danube at Budapest
The reason for such thing is that I may work offline, wth the jupyter notebook downloaded. This will make sure that the stuff will get downloaded next to the notebook, disregarding where the notebook is, because its checks where it is.
Download the data
Now that we have the data in a usable form, the columns names need to be changed for a more understandable names.
Load the downloaded data file into the processing environment paying attention to handling time stamps and perfoming the necessary data type conversions. Converting dates to floating point numbers such as unix time stamp or julian date usually makes handling time series easier. Plot the data for a certain interval to show that the effect of storms is clearly visible.
Opened: /v/wfct0p/Dataexp-archive/data-exp-vis-2021-solutions/A-02-Timeseries/01471875-data/01471875-20190101-20191231.csv Loading Done Change column names from: ['agency_cd', 'site_no', 'datetime', 'tz_cd', '121460_00065', '121460_00065_cd', '121461_00060', '121461_00060_cd', '240324_00010', '240324_00010_cd'] to ['Source', 'Site_No.', 'Date', 'Timezone', 'Gage_height', 'A1', 'Discharge', 'A2']
| Source | Site_No. | Date | Timezone | Gage_height | A1 | Discharge | A2 | U1 | U2 | |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | USGS | 01471875 | 17897.000000 | EST | 4.69 | A | 723.0 | A | 5.6 | A |
| 2 | USGS | 01471875 | 17897.010417 | EST | 4.78 | A | 773.0 | A | 5.5 | A |
| 3 | USGS | 01471875 | 17897.020833 | EST | 4.86 | A | 824.0 | A | 5.5 | A |
| 4 | USGS | 01471875 | 17897.031250 | EST | 4.93 | A | 871.0 | A | 5.5 | A |
| 5 | USGS | 01471875 | 17897.041667 | EST | 5.01 | A | 926.0 | A | 5.5 | A |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 35032 | USGS | 01471875 | 18261.947917 | EST | 3.03 | A | 106.0 | A | 6.1 | A |
| 35033 | USGS | 01471875 | 18261.958333 | EST | 3.03 | A | 106.0 | A | 6.0 | A |
| 35034 | USGS | 01471875 | 18261.968750 | EST | 3.02 | A | 104.0 | A | 6.0 | A |
| 35035 | USGS | 01471875 | 18261.979167 | EST | 3.02 | A | 104.0 | A | 6.0 | A |
| 35036 | USGS | 01471875 | 18261.989583 | EST | 3.02 | A | 104.0 | A | 6.0 | A |
35036 rows × 10 columns
/opt/conda/lib/python3.8/site-packages/plotly/graph_objs/_deprecations.py:378: DeprecationWarning: plotly.graph_objs.Line is deprecated. Please replace it with one of the following more specific types - plotly.graph_objs.scatter.Line - plotly.graph_objs.layout.shape.Line - etc.
Plot the histogram of water discharge values. Fit the data with an appropriate distribution function and bring arguments in favor of the choice of function.
NOTE: as it is visible, I tried a bunch of distributions to be fitted, going from the less tail heavy ones to the heavier side.
NOTE2: Probably some of the frequent (small) values should be emitted so we can get a more accurate fit to describe the tail
CHOICE: earlier a lot of distributions were given to make a choice from. The best I was able to achive was with the Cauchy distribution, which fits the peak well, bu not the tail. I went on a bit of research to find about the Burr distribution, which is I think is the best fitting on this data.
NOTE: scipy.signal.find_peaks could find all the peaks.
NOTE2: the peaks mostly at the end of the rainy events or rainy periods, but cannot make a fine assumption only with them. A small river (like this) reacts to rains almost instantly, but still not instantly. This means that there is a little bit of shift between rains and in the increase of the water discharge.
As it is visible a lot of "peaks" are found but most of them are not peaks - they are just coming from some noise like domain. At this point, data smoothening is needed. With this one hav to becareful cause it can move the ral peaks a bit away from their orignal position. With smoothing, one can use any peak finder do detect peaks.
Now I need to switch find_peaks with my own peak detector...
NOTE: after many, many tries, it is apparent that there is no peak finder that could find all peaks (and sort all non-peaks). The problem is that here are very big peaks and small ones. Some of them distorted and some has NaNs around them. (The latter makes it hard to use correlation function to be used for peak conformation.) Even scipy.find_peaks has issues finding all of them. Maybe wavelets? I spent way to much time to resolve this issue and in the end what i am going to do is to use the savgol_filter and peak_finder from scipy.
Water discharge increases significantly during rain producing maxima in the time series. Plot the distribution of maximum values and fit with an appropriate function. Bring arguments to support the choice of probabilistic model.
It seems like they are overlapping, there is a little bit of difference, about −10−9. In the end, an assymteric and and heavy-tailed distrbution can be correctly fitted. Long rains at this river are rare and the shorter they are, the lower the peak connected to them is. This rare occurence high values means that a power-law like function will fit well.
Once rainy events are detected, plot the distribution of the length of sunny intervals between rains. Fit the distribution with an appropriate function.
NOTE: we alredy know their index, givenby find_peaks. This could be easily used to see their dates, and now we just have to calculate these dates distance. Same histogramm, same plot, maybe different distribution. There is always a visible end (the peak) of the rainy event. There cannot be two rains and for the next peak to be visible, a volley should be present. So this must follow a distribution that has a defined center. Also, there is almost a year that is missing from my data.
Min: 0.07291666697710752 days Max: 7.708333333488554 days
Min: 0.07291666697710752 days Max: 7.708333333488554 days
The best I was able to achieve is with the $\textbf{Cauchy}$ distribution and its still arguable how good the fit is. The Burr distribution fits the tail and peak well, but gives zero before the peak, so it is not good. The others are even worse, even the Poisson distribution.
What is the maximum of water discharge in an arbitrarily chosen period of one year? Calculate the maximum of water discharge due to rain in a rolling window of 1 year, plot its distribution and fit with an appropriate function.
How many time does it rain in a month? (Again use a rolling window function) Calculate and plot the distribution and fit with an appropriate function.
Let's choose 30 days as a month!
32160
As it is visible, there are interval lengths which don't appear. If the bins width is lower, more of this appears
Find the measuring station you used in the excercises above on the map. Find another measurement station about 100-200 kms from it and download the data. Try to estimate the typical time it takes for weather fronts to travel the distance between the two measuring stations.